Statistical inference, part I

Eva Freyhult

NBIS, SciLifeLab

2022-09-13

Statistical inference

Statistical inference is to draw conclusions regarding properties of a population based on observations of a random sample from the population.

Hypothesis test

To perform a hypothesis test is to evaluate a hypothesis based on a random sample.

Typically, the hypotheses that are tested are assumptions about properties of a population, such as proportion, mean, mean difference, variance etc.

The null and alternative hypothesis

There are two hypotheses involved in a hypothesis test, the null hypothesis, \(H_0\), and the alternative hypothesis, \(H_1\).

\(H_0\), the null hypothesis is in general neutral, “no change”, “no difference between groups”, “no association”.

In general we want to show that \(H_0\) is false.

\(H_1\), The alternative hypothesis expresses what the researcher is interested in “the treatment has an effect”, “there is a difference between groups”, “there is an association”.

The alternative hypothesis can also be directional “the treatment has a positive effect”.

Error types

H0 is true H0 is false
Accept H0 Type II error, miss
Reject H0 Type I error, false alarm

Type I error is a false positive, a false alarm that occurs when \(H_0\) is rejected when it is actually true. Examples: “The test says that you are covid-19 positive, when you actually are not”, “The test says that the drug has a positive effect on patient symptoms, but it actually has not”.

Type II error is a false negative, a miss that occurs when \(H_0\) is accepted, when it is actually false. Examples: The test says that you are covid-19 negative, when you actually have covid-19”, “The test says that the drug has no effect on patient symptoms, when it actually has”.

Significance level

H0 is true H0 is false
Accept H0 Type II error, miss
Reject H0 Type I error, false alarm

The significance level, \(\alpha\) = P(false alarm) = P(Reject \(H_0\)| \(H_0\) is true).

The significance level should be set before the hypothesis test is performed!

Common values of \(\alpha\) are 0.05 or 0.01.

Statistical power

H0 is true H0 is false
Accept H0 Type II error, miss
Reject H0 Type I error, false alarm

Another property of a statistical test is the statistical power

power = P(Reject H0|H0 is false).

To perform a hypothesis test

  1. Define \(H_0\) and \(H_1\) . . .
  2. Select appropriate test statistic, \(T\), and compute the observed value, \(t_{obs}\) . . .
  3. Assume that the \(H_0\) is true and compute the sampling distribution of \(T\). . . .
  4. Select an appropriate significance level, \(\alpha\) . . .
  5. Compare the observed value, \(t_{obs}\), with the computed sampling distribution under \(H_0\) and compute a p-value. The p-value is the probability of observing a value at least as extreme as the observed value, if \(H_0\) is true. . . .
  6. Based on the p-value either accept or reject \(H_0\).

Null distribution

A sampling distribution is the distribution of a sample statistic. The sampling distribution can be obtained by drawing a large number of samples from a specific population.

The null distribution is a sampling distribution when the null hypothesis is true.

A null distribution

p-value

The p-value is the probability of the observed value, or something more extreme, if the null hypothesis is true.

p-value

The p-value is the probability of the observed value, or something more extreme, if the null hypothesis is true.

p-value

If the p-value is above the significance level, \(H_0\) is accepted.

If the p-value is below the significance level, \(H_0\) is rejected.

Simulation example

Do high fat diet lead to increased body weight?

Study setup:

  1. Order 24 female mice from a lab.
  2. Randomly assign 12 of the 24 mice to receive high-fat diet, the remaining 12 are controls (ordinary diet).
  3. Measure body weight after one week.

The observed values, mouse weights in grams, are summarized below;

high-fat 25 30 23 18 31 24 39 26 36 29 23 32
ordinary 27 25 22 23 25 37 24 26 21 26 30 24

Simulation example

1. Null and alternative hypotheses

\[ \begin{aligned} H_0: \mu_2 = \mu_1 \iff \mu_2 - \mu_1 = 0\\ H_1: \mu_2>\mu_1 \iff \mu_2-\mu_1 > 0 \end{aligned} \]

where \(\mu_2\) is the (unknown) mean body weight of the high-fat mouse population and \(\mu_1\) is the mean body-weight of the control mouse population.

Studied population: Female mice that can be ordered from a lab.

Simulation example

2. Select appropriate significance level \(\alpha\)

\[\alpha = 0.05\]

Simulation example

3. Test statistic

Of interest; the mean difference between high-fat and control mice

\[D = \bar X_2 - \bar X_1\]

Mean weight of 12 (randomly selected) mice on ordinary diet, \(\bar X_1\). \(E[\bar X_1] = E[X_1] = \mu_1\)

Mean weight of 12 (randomly selected) mice on high-fat diet, \(\bar X_2\). \(E[\bar X_2] = E[X_2] = \mu_2\)

Observed values;

\(\bar x_1 = 25.83\), mean weight of control mice (ordinary diet)

\(\bar x_2 = 28.00\), mean weight of mice on high-fat diet

\(d_{obs} = \bar x_2 - \bar x_1 = 2.1667\), difference in mean weights

Simulation example

4. Null distribution

If high-fat diet has no effect, i.e. if \(H_0\) was true, the result would be as if all mice were given the same diet.

The 24 mice were initially from the same population, depending on how the mice are randomly assigned to high-fat and normal group, the mean weights would differ, even if the two groups were treated the same.

Random reassignment to two groups can be accomplished using permutation.

Assume \(H_0\) is true, i.e. assume all mice are equivalent and

  1. Randomly reassign 12 of the 24 mice to ‘high-fat’ and the remaining 12 to ‘control’.
  2. Compute difference in mean weights

If we repeat 1-2 many times we get the sampling distribution when \(H_0\) is true, the so called null distribution, of difference in mean weights.

Simulation example

4. Null distribution

Simulation example

5. Compute p-value

What is the probability to get an at least as extreme mean difference as our observed value, \(d_{obs}\), if \(H_0\) was true?

\(P(\bar X_2 - \bar X_2 \geq d_{obs} | H_0) =\) 0.17

Simulation example

6. Conclusion?

Multiple testing

Error types

Accept H0 Reject H0
H0 is true Type I error, false alarm
H0 is false Type II error, miss

Multiple testing

Error types

Accept H0 Reject H0
H0 is true Type I error, false alarm
H0 is false Type II error, miss
Accept H0 Reject H0
H0 is true TN FP
H0 is false FN TP

Multiple testing

Error types

Accept H0 Reject H0
H0 is true Type I error, false alarm
H0 is false Type II error, miss
Accept H0 Reject H0
H0 is true TN FP
H0 is false FN TP

Significance level

\[P(\mbox{reject }\,H_0 | H_0 \,\mbox{is true}) = P(\mbox{type I error}) = \alpha\]

Statistical power

\[P(\mbox{reject } H_0 | H_1 \mbox{ is true}) = P(\mbox{reject } H_0 | H_0 \mbox{ is false}) = 1 - P(\mbox{type II error})\]

Multiple tests

Perform one test:

  • P(One type I error) = \(\alpha\)
  • P(No type I error) = \(1 - \alpha\)

Perform \(m\) independent tests:

  • P(No type I errors in \(m\) tests) = \((1 - \alpha)^m\)
  • P(At least one type I error in \(m\) tests) = \(1 - (1 - \alpha)^m\)

Multiple test correction

  • FWER: family-wise error rate, control the probability of one or more false positive, e.g. Bonferroni, Holm
  • FDR: false discovery rate, control the proportion of false positives among the hits, e.g. Benjamini-Hochberg, Storey

Bonferroni correction

To achieve a family-wise error rate of \(\leq \alpha\) when performing \(m\) tests, declare significance and reject the null hypothesis for any test with \(p \leq \alpha/m\).

Objections: too conservative

Benjamini-Hochbergs FDR

H0 is true H0 is false
Accept H0 TN FN
Reject H0 FP TP

The false discovery rate is the proportion of false positives among ‘hits’, i.e. \(\frac{FP}{TP+FP}\).

Benjamini-Hochberg’s method control the FDR level, \(\gamma\), when performing \(m\) independent tests, as follows:

  1. Sort the p-values \(p_1 \leq p_2 \leq \dots \leq p_m\).
  2. Find the maximum \(j\) such that \(p_j \leq \gamma \frac{j}{m}\).
  3. Declare significance for all tests \(1, 2, \dots, j\).

‘Adjusted’ p-values

Sometimes an adjusted significance threshold is not reported, but instead ‘adjusted’ p-values are reported.

  • Using Bonferroni’s method the ‘adjusted’ p-values are:

    \(\tilde p_i = \min(m p_i, 1)\).

A feature’s adjusted p-value represents the smallest FWER at which the null hypothesis will be rejected, i.e. the feature will be deemed significant.

  • Benjamini-Hochberg’s ‘adjusted’ p-values are called \(q\)-values:

    \(q_i = \min(\frac{m}{i} p_i, 1)\)

    A feature’s \(q\)-value can be interpreted as the lowest FDR at which the corresponding null hypothesis will be rejected, i.e. the feature will be deemed significant.

Example, 10000 independent tests (e.g. genes)

p-value adj p (Bonferroni) q-value (B-H)
1.7e-08 0.0002 0.0002
5.8e-08 0.0006 0.0003
3.4e-07 0.0034 0.0011
9.1e-07 0.0091 0.0020
1e-06 0.0100 0.0020
2.4e-06 0.0240 0.0040
2.3e-05 0.2300 0.0329
3.6e-05 0.3600 0.0450
0.00022 1.0000 0.2300
0.00023 1.0000 0.2300
0.00073 1.0000 0.6636
0.0032 1.0000 1.0000
0.0045 1.0000 1.0000
0.0087 1.0000 1.0000
0.0089 1.0000 1.0000
0.012 1.0000 1.0000
0.014 1.0000 1.0000
0.045 1.0000 1.0000
0.08 1.0000 1.0000
0.23 1.0000 1.0000